1.2 Main Question
How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?
library(ggplot2)
library(dplyr)
library(tidyverse)
library(plotly)
The variables can be split into the following fields:
How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?
MUSIC PREFERENCES
Since there are so many fields and many variables, I took a specific field to analyze and see how can I represent all the variables into only one visualization in order to gain some insights from that specific field.
The chosen field is: MUSIC
data = read.csv("responses.csv")
columns = read.csv("columns.csv")
head(data)
See the columns data
head(columns)
Firstly, we will take a look at the type of our data
str(data)
## 'data.frame': 1010 obs. of 150 variables:
## $ Music : int 5 4 5 5 5 5 5 5 5 5 ...
## $ Slow.songs.or.fast.songs : int 3 4 5 3 3 3 5 3 3 3 ...
## $ Dance : int 2 2 2 2 4 2 5 3 3 2 ...
## $ Folk : int 1 1 2 1 3 3 3 2 1 5 ...
## $ Country : int 2 1 3 1 2 2 1 1 1 2 ...
## $ Classical.music : int 2 1 4 1 4 3 2 2 2 2 ...
## $ Musical : int 1 2 5 1 3 3 2 2 4 5 ...
## $ Pop : int 5 3 3 2 5 2 5 4 3 3 ...
## $ Rock : int 5 5 5 2 3 5 3 5 5 5 ...
## $ Metal.or.Hardrock : int 1 4 3 1 1 5 1 1 5 2 ...
## $ Punk : int 1 4 4 4 2 3 1 2 1 3 ...
## $ Hiphop..Rap : int 1 1 1 2 5 4 3 3 1 2 ...
## $ Reggae..Ska : int 1 3 4 2 3 3 1 2 2 4 ...
## $ Swing..Jazz : int 1 1 3 1 2 4 1 2 2 4 ...
## $ Rock.n.roll : int 3 4 5 2 1 4 2 3 2 4 ...
## $ Alternative : int 1 4 5 5 2 5 3 1 NA 4 ...
## $ Latino : int 1 2 5 1 4 3 3 2 1 5 ...
## $ Techno..Trance : int 1 1 1 2 2 1 5 3 1 1 ...
## $ Opera : int 1 1 3 1 2 3 2 2 1 2 ...
## $ Movies : int 5 5 5 5 5 5 4 5 5 5 ...
## $ Horror : int 4 2 3 4 4 5 2 4 1 2 ...
## $ Thriller : int 2 2 4 4 4 5 1 4 5 1 ...
## $ Comedy : int 5 4 4 3 5 5 5 5 5 5 ...
## $ Romantic : int 4 3 2 3 2 2 3 2 4 5 ...
## $ Sci.fi : int 4 4 4 4 3 3 1 3 4 1 ...
## $ War : int 1 1 2 3 3 3 3 3 5 3 ...
## $ Fantasy.Fairy.tales : int 5 3 5 1 4 4 5 4 4 4 ...
## $ Animated : int 5 5 5 2 4 3 5 4 4 4 ...
## $ Documentary : int 3 4 2 5 3 3 3 3 5 4 ...
## $ Western : int 1 1 2 1 1 2 1 1 1 1 ...
## $ Action : int 2 4 1 2 4 4 2 3 1 2 ...
## $ History : int 1 1 1 4 3 5 3 5 3 3 ...
## $ Psychology : int 5 3 2 4 2 3 3 2 2 2 ...
## $ Politics : int 1 4 1 5 3 4 1 3 1 3 ...
## $ Mathematics : int 3 5 5 4 2 2 1 1 1 3 ...
## $ Physics : int 3 2 2 1 2 3 1 1 1 1 ...
## $ Internet : int 5 4 4 3 2 4 2 5 1 5 ...
## $ PC : int 3 4 2 1 2 4 1 4 1 1 ...
## $ Economy.Management : int 5 5 4 2 2 1 3 1 1 4 ...
## $ Biology : int 3 1 1 3 3 4 5 2 3 2 ...
## $ Chemistry : int 3 1 1 3 3 4 5 2 1 1 ...
## $ Reading : int 3 4 5 5 5 3 3 2 5 4 ...
## $ Geography : int 3 4 2 4 2 3 3 3 1 4 ...
## $ Foreign.languages : int 5 5 5 4 3 4 4 4 1 5 ...
## $ Medicine : int 3 1 2 2 3 4 5 1 1 1 ...
## $ Law : int 1 2 3 5 2 3 3 2 1 1 ...
## $ Cars : int 1 2 1 1 3 5 4 1 1 1 ...
## $ Art.exhibitions : int 1 2 5 5 1 2 1 1 1 4 ...
## $ Religion : int 1 1 5 4 4 2 1 2 2 4 ...
## $ Countryside..outdoors : int 5 1 5 1 4 5 4 2 4 4 ...
## $ Dancing : int 3 1 5 1 1 1 3 1 1 5 ...
## $ Musical.instruments : int 3 1 5 1 3 5 2 1 2 3 ...
## $ Writing : int 2 1 5 3 1 1 1 1 1 1 ...
## $ Passive.sport : int 1 1 5 1 3 5 5 4 4 4 ...
## $ Active.sport : int 5 1 2 1 1 4 3 5 1 4 ...
## $ Gardening : int 5 1 1 1 4 2 3 1 1 1 ...
## $ Celebrities : int 1 2 1 2 3 1 1 3 5 2 ...
## $ Shopping : int 4 3 4 4 3 2 3 3 2 4 ...
## $ Science.and.technology : int 4 3 2 3 3 3 4 2 1 3 ...
## $ Theatre : int 2 2 5 1 2 1 3 2 5 5 ...
## $ Fun.with.friends : int 5 4 5 2 4 3 5 4 4 5 ...
## $ Adrenaline.sports : int 4 2 5 1 2 3 1 2 1 2 ...
## $ Pets : int 4 5 5 1 1 2 5 5 1 2 ...
## $ Flying : int 1 1 1 2 1 3 1 3 2 4 ...
## $ Storm : int 1 1 1 1 2 2 3 2 3 5 ...
## $ Darkness : int 1 1 1 1 1 2 2 4 1 4 ...
## $ Heights : int 1 2 1 3 1 2 1 3 5 5 ...
## $ Spiders : int 1 1 1 5 1 1 1 1 5 3 ...
## $ Snakes : int 5 1 1 5 1 2 5 5 5 4 ...
## $ Rats : int 3 1 1 5 2 2 1 3 2 4 ...
## $ Ageing : int 1 3 1 4 2 1 4 1 2 3 ...
## $ Dangerous.dogs : int 3 1 1 5 4 1 1 2 3 5 ...
## $ Fear.of.public.speaking : int 2 4 2 5 3 3 1 4 4 3 ...
## $ Smoking : chr "never smoked" "never smoked" "tried smoking" "former smoker" ...
## $ Alcohol : chr "drink a lot" "drink a lot" "drink a lot" "drink a lot" ...
## $ Healthy.eating : int 4 3 3 3 4 2 4 2 1 3 ...
## $ Daily.events : int 2 3 1 4 3 2 3 3 1 4 ...
## $ Prioritising.workload : int 2 2 2 4 1 2 5 1 2 2 ...
## $ Writing.notes : int 5 4 5 4 2 3 5 3 1 2 ...
## $ Workaholism : int 4 5 3 5 3 3 5 2 4 3 ...
## $ Thinking.ahead : int 2 4 5 3 5 3 3 4 2 3 ...
## $ Final.judgement : int 5 1 3 1 5 1 3 3 5 5 ...
## $ Reliability : int 4 4 4 3 5 3 4 3 5 4 ...
## $ Keeping.promises : int 4 4 5 4 4 4 5 3 4 5 ...
## $ Loss.of.interest : int 1 3 1 5 2 3 3 1 1 3 ...
## $ Friends.versus.money : int 3 4 5 2 3 2 4 4 4 4 ...
## $ Funniness : int 5 3 2 1 3 3 4 4 2 3 ...
## $ Fake : int 1 2 4 1 2 1 1 2 2 1 ...
## $ Criminal.damage : int 1 1 1 5 1 4 2 1 1 2 ...
## $ Decision.making : int 3 2 3 5 3 2 2 3 4 5 ...
## $ Elections : int 4 5 5 5 5 5 5 5 1 5 ...
## $ Self.criticism : int 1 4 4 5 5 4 3 3 3 4 ...
## $ Judgment.calls : int 3 4 4 4 5 4 5 5 2 5 ...
## $ Hypochondria : int 1 1 1 3 1 1 1 2 2 1 ...
## $ Empathy : int 3 2 5 3 3 4 4 1 5 4 ...
## $ Eating.to.survive : int 1 1 5 1 1 2 1 2 1 1 ...
## $ Giving : int 4 2 5 1 3 3 5 3 1 4 ...
## $ Compassion.to.animals : int 5 4 4 2 3 5 5 5 4 5 ...
## $ Borrowed.stuff : int 4 3 2 5 4 5 5 2 5 4 ...
## [list output truncated]
Secondly, we will split each category in dataframes for a better management
# Split into DFs
music_dataframe = data.frame(data[1:19])
movies_dataframe = data.frame(data[20:31])
hobbies_dataframe = data.frame(data[32:63])
phobias_dataframe = data.frame(data[64:73])
health_habits_dataframe = data.frame(data[74:76])
personality_dataframe = data.frame(data[78:133])
spendings_dataframe = data.frame(data[134:140])
demographics_dataframe = data.frame(data[141:150])
Now, we will take a look how much people enjoy the Music in general
Firstly create a dataframe of the frequencies for the specific question
overall_music = music_dataframe %>%
group_by(Music) %>%
summarise(counts = n())
overall_music
overall_music_plot = ggplot(overall_music, aes(x = Music, y = counts, fill=counts)) +
geom_bar(stat = "identity", fill="steelblue")+
geom_text(aes(label= counts), vjust = -0.3)+
theme(axis.title.x = element_blank(),
plot.title = element_text(hjust = 0.5))+
labs(y="Number of respondents")+
ggtitle("How much people enjoy Music?")+
geom_text(x=1, y=770, label="1 - Strongly disagree", color="red", hjust = 0, size = 5)+
geom_text(x=1, y=730, label="5 - Strongly agree", color = "blue", hjust = 0, size = 5)+
scale_y_continuous(breaks=round(seq(0, max(overall_music$counts), by=100), 1))
overall_music_plot
Now comes the hard part
We want to convert the music_dataframe into the following table example:
| Question | Strongly_disagree | Disagree | Neutral | Agree | Strongly agree |
|---|---|---|---|---|---|
| Music | 19 | 20 | 31 | 42 | 50 |
| Dance | 11 | 22 | 33 | 44 | 55 |
This table represents the following:
For this, we will take the columns and retain it into a variable
## Convert columns to Rows
Question = colnames(music_dataframe)
Question
## [1] "Music" "Slow.songs.or.fast.songs"
## [3] "Dance" "Folk"
## [5] "Country" "Classical.music"
## [7] "Musical" "Pop"
## [9] "Rock" "Metal.or.Hardrock"
## [11] "Punk" "Hiphop..Rap"
## [13] "Reggae..Ska" "Swing..Jazz"
## [15] "Rock.n.roll" "Alternative"
## [17] "Latino" "Techno..Trance"
## [19] "Opera"
Then we create a dataframe with the column “Question” and add as values each Category from the Field of Music After that, we will add as columns the available responses with NA values (since this needs to be preprocessed first)
df = data.frame(Question)
df = df %>%
add_column("Strongly_disagree" = NA,
"Disagree" = NA,
"Neutral" = NA,
"Agree" = NA,
"Strongly_agree" = NA)
df
Now we will count and group each occurrence for each response (1-5) and store it in occurrences table This table will be unzipped then because we are interested only in the values, not in the columns
# Count occurrences of each value in each column
occurrences <- lapply(music_dataframe, table)
values <- lapply(occurrences, function(x) as.vector(unname(x)))
print(occurrences)
## $Music
##
## 1 2 3 4 5
## 9 11 36 129 822
##
## $Slow.songs.or.fast.songs
##
## 1 2 3 4 5
## 23 66 580 235 104
##
## $Dance
##
## 1 2 3 4 5
## 94 215 321 235 141
##
## $Folk
##
## 1 2 3 4 5
## 290 336 235 87 57
##
## $Country
##
## 1 2 3 4 5
## 343 346 198 85 33
##
## $Classical.music
##
## 1 2 3 4 5
## 138 249 281 189 146
##
## $Musical
##
## 1 2 3 4 5
## 195 248 283 166 116
##
## $Pop
##
## 1 2 3 4 5
## 60 155 260 314 218
##
## $Rock
##
## 1 2 3 4 5
## 55 105 207 294 343
##
## $Metal.or.Hardrock
##
## 1 2 3 4 5
## 384 219 164 136 104
##
## $Punk
##
## 1 2 3 4 5
## 319 227 220 152 84
##
## $Hiphop..Rap
##
## 1 2 3 4 5
## 212 206 206 224 158
##
## $Reggae..Ska
##
## 1 2 3 4 5
## 179 252 285 195 92
##
## $Swing..Jazz
##
## 1 2 3 4 5
## 204 231 271 198 100
##
## $Rock.n.roll
##
## 1 2 3 4 5
## 116 191 298 231 167
##
## $Alternative
##
## 1 2 3 4 5
## 213 220 247 172 151
##
## $Latino
##
## 1 2 3 4 5
## 191 250 238 172 151
##
## $Techno..Trance
##
## 1 2 3 4 5
## 380 207 194 140 82
##
## $Opera
##
## 1 2 3 4 5
## 389 294 175 98 53
values
## $Music
## [1] 9 11 36 129 822
##
## $Slow.songs.or.fast.songs
## [1] 23 66 580 235 104
##
## $Dance
## [1] 94 215 321 235 141
##
## $Folk
## [1] 290 336 235 87 57
##
## $Country
## [1] 343 346 198 85 33
##
## $Classical.music
## [1] 138 249 281 189 146
##
## $Musical
## [1] 195 248 283 166 116
##
## $Pop
## [1] 60 155 260 314 218
##
## $Rock
## [1] 55 105 207 294 343
##
## $Metal.or.Hardrock
## [1] 384 219 164 136 104
##
## $Punk
## [1] 319 227 220 152 84
##
## $Hiphop..Rap
## [1] 212 206 206 224 158
##
## $Reggae..Ska
## [1] 179 252 285 195 92
##
## $Swing..Jazz
## [1] 204 231 271 198 100
##
## $Rock.n.roll
## [1] 116 191 298 231 167
##
## $Alternative
## [1] 213 220 247 172 151
##
## $Latino
## [1] 191 250 238 172 151
##
## $Techno..Trance
## [1] 380 207 194 140 82
##
## $Opera
## [1] 389 294 175 98 53
Then we manually add these values for each ROW starting from the second column until the end P.S.: I added manually because the ‘for’ loop didn’t want to help me
### INSERT VALUES FOR THE COLUMNS OF EACH QUESTION
df[1, 2:(1 + length(values$Music))] <- c(values$Music, df[2, -(1:(1 + length(values$Music)))])
df[2, 2:(1 + length(values$Slow.songs.or.fast.songs))] <- c(values$Slow.songs.or.fast.songs, df[2, -(1:(1 + length(values$Slow.songs.or.fast.songs)))])
df[3, 2:(1 + length(values$Dance))] <- c(values$Dance, df[2, -(1:(1 + length(values$Dance)))])
df[4, 2:(1 + length(values$Folk))] <- c(values$Folk, df[2, -(1:(1 + length(values$Folk)))])
df[5, 2:(1 + length(values$Country))] <- c(values$Country, df[2, -(1:(1 + length(values$Country)))])
df[6, 2:(1 + length(values$Classical.music))] <- c(values$Classical.music, df[2, -(1:(1 + length(values$Classical.music)))])
df[7, 2:(1 + length(values$Musical))] <- c(values$Musical, df[2, -(1:(1 + length(values$Musical)))])
df[8, 2:(1 + length(values$Pop))] <- c(values$Pop, df[2, -(1:(1 + length(values$Pop)))])
df[9, 2:(1 + length(values$Rock))] <- c(values$Rock, df[2, -(1:(1 + length(values$Rock)))])
df[10, 2:(1 + length(values$Metal.or.Hardrock))] <- c(values$Metal.or.Hardrock, df[2, -(1:(1 + length(values$Metal.or.Hardrock)))])
df[11, 2:(1 + length(values$Punk))] <- c(values$Punk, df[2, -(1:(1 + length(values$Punk)))])
df[12, 2:(1 + length(values$Hiphop..Rap))] <- c(values$Hiphop..Rap, df[2, -(1:(1 + length(values$Hiphop..Rap)))])
df[13, 2:(1 + length(values$Reggae..Ska))] <- c(values$Reggae..Ska, df[2, -(1:(1 + length(values$Reggae..Ska)))])
df[14, 2:(1 + length(values$Swing..Jazz))] <- c(values$Swing..Jazz, df[2, -(1:(1 + length(values$Swing..Jazz)))])
df[15, 2:(1 + length(values$Rock.n.roll))] <- c(values$Rock.n.roll, df[2, -(1:(1 + length(values$Rock.n.roll)))])
df[16, 2:(1 + length(values$Alternative))] <- c(values$Alternative, df[2, -(1:(1 + length(values$Alternative)))])
df[17, 2:(1 + length(values$Latino))] <- c(values$Latino, df[2, -(1:(1 + length(values$Latino)))])
df[18, 2:(1 + length(values$Techno..Trance))] <- c(values$Techno..Trance, df[2, -(1:(1 + length(values$Techno..Trance)))])
df[19, 2:(1 + length(values$Opera))] <- c(values$Opera, df[2, -(1:(1 + length(values$Opera)))])
df
For the plot creation we will choose a stacked bar chart, since we have discrete values (number of occurrences) combined with categorical values (Strongly disagree-Strongly agree) and nominal values (Music, Dance, Folk, Opera, etc.)
For this we will transform the dataframe into a long format using tidyr
df_long <- tidyr::gather(df, Answer, Occurrences, -Question)
df_long
Now we want also to calculate the percentages of each response from each Category
# Calculate percentages
df_long <- transform(df_long, Percentage = Occurrences / tapply(Occurrences, Question, sum)[Question] * 100)
# Reorder the levels for Answer column
df_long$Answer <- factor(df_long$Answer, levels = c("Strongly_agree", "Agree", "Neutral", "Disagree", "Strongly_disagree"))
df_long
And now we will create the wanted stacked bar chart
# Horizontal Stacked Bar Chart
stacked_music_field <- ggplot(df_long, aes(x = Percentage, y = Question, fill = Answer, text = paste("Answer: ", Answer, "<br>Occurrences: ", Occurrences, "<br>Percentage: ", Percentage, "%"))) +
geom_bar(stat = "identity") +
geom_text(aes(label = ifelse(Percentage < 3, "", paste0(round(Percentage, 1), "%"))),
position = position_stack(vjust = 0.5), color = "blue", size = 3.5) +
geom_text(aes(label = Occurrences), size = 0, alpha = 0) + # Add hidden Occurrences
labs(x = "Percentage", y = "Category") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("How much people enjoy the following Categories?")+
scale_fill_manual(name = "Scale", values = c("seagreen4", "green", "grey", "chocolate1", "tomato2"), labels = c("Strongly disagree", "Disagree", "Neutral", "Agree", "Strongly agree"))
# Convert ggplot to a plotly plot
stacked_music_field <- ggplotly(stacked_music_field, tooltip = "text")
# Modify the tooltip text
stacked_music_field$x$data[[1]]$text <- df_long$Answer
# Print the plot
stacked_music_field
Note: As we seen from first section, the Question: “Do you prefer Slow songs or Fast songs?” is valued as the following:
This project helped me to draw some important conclusion during the coding process: